Multivariate data exploration

GEOG 30323

February 11, 2020

Why visualize data?

The greatest value of a picture is when it forces us to notice what we never expected to see.

  • Tukey (1977) quoted in Yau (2013)

Exploring data visually

Source: Yau, Data Points p. 137

Our schedule:

  • Current activities: data exploration through visualization with common chart types
  • Weeks 10-14: deep dive into data visualization
    • More complex chart types
    • How to customize your seaborn plots
    • Best practices in data visualization
    • Interactive web-based graphics
    • Maps!

Exploratory chart types

  • Comparing categories: bar chart, dot plot
  • Part-to-whole: pie chart
  • Change over time: line chart
  • Connections and relationships: scatter plot

Many, many more in these categories - these are just our focus for today!

Python and the web

  • A brief aside: With Python, data on the web is at your fingertips (our topic for Week 9)
  • This week, you will get a preview

Comparing categories

How about sorting our data?

Bar charts

Source: FiveThirtyEight.com

Bar charts

  • Length or height of bars proportional to data values, allowing for comparisons between categories
  • The value axis of bar charts must start at zero!!!
  • Recommendation: sort your data values for ease of interpretation

Bar chart with non-zero origin

Source: Fox News via FlowingData.com

Bar charts in Python

Bar charts in seaborn

Dot plots

Source: FiveThirtyEight.com

Dot plots

  • Can be preferable to bar charts - values determined by position along axis rather than bar heights
  • In turn, zero origin not strictly necessary (though consider the context)
  • Sorted data also preferable for dot plots

Dot plots in seaborn

Part-to-whole

  • Categories in relationship to the entire population of values
  • Examples: pie chart, waffle chart, 100% bar chart, tree map
  • Must sum to 100%!

Pie charts in Python

Problems with pie charts

Source: Fox Chicago via FlowingData.com

Problems with pie charts

Source: Wikimedia Commons

Line charts

Source: FiveThirtyEight.com

Line charts in seaborn

Scatter plots

  • Question: how do the values in two columns covary?
  • Scatter plot: each observation represented by a point; position along x axis dictated by one column value; position along y axis dictated by other column value
  • Regression line: visual representation of estimated statistical relationship between X and Y

Scatter plots

Source: FiveThirtyEight.com

Scatter plots in seaborn

Scatter plots in seaborn

  • Also available in the lmplot and regplot functions

Correlation

  • Correlation coefficient: statistical representation of how two samples covary; ranges between -1 (negative correlation) and +1 (positive correlation)
  • In pandas: .corr()
  • Beware of spurious correlations! http://tylervigen.com/spurious-correlations